Xento-Cento Data Set Classification

Import Libraries

Setting Dataset path for Train and Test Data

Function for Mantaining Log Files

Function for Calculating Time of a process

Loading Train/Test CSV file

Priting first five rows of train dataset along columns

Exploratory Data Analysis (EDA)

The describe() method returns description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains these information for each column: count - The number of not-empty values. mean - The average (mean) value. std - The standard deviation.

In our case we have 4 numeric columns inculding rating, duration, xc_id and resampled_sampling_rate.

Info method shows index dtype and columns, non-null values and memory usage of dataset.

Check Null Values

In above cell we write custom function to check the null values in our dataset including their Percentage w.r.t each column.

Dataset Size

Our dataset have 21375 rows and 38 columns

Column-wise Unique Values

Above we can see that each column contains how many uniques values in it.

Value Counts and DIstribution Plot

A good way to understand the distribution of data in a column is by plotting a distribution plot. We write a function plot_distribution which will save us a lot of time.

Species Type Distribution

In above graph we can see the number of species and thier value counts of each species type

Rating and Type Exploration

Now that we have the range, let's see if we have some great bird singers with us!

In above graph we can see the rating distribution w.r.t to frequencies and we can clearly see the most highly rated bird songs have frequency around 6000. Similalry we can notice the trend that higher the frequency will get higher rating vice versa.

Different Types of Speices of Birds Analysis

We can't directly plot the type column as multiple types are mixed together. One of the examples is shown below.

In above graph we can see the top sound type and their frequencies

Create some time features

We split the date column into three separte columns year, month and day_of_month so we can get more useful insights in EDA

Droping Ambiguous Date Entries

Type casting of all columns as all data are in object type so we have to convert into their respective data type as per their nature.

Removing thing data on the base of year which are insignificant in number and would provide little or no value to the study, were also removed.

Time of the Recording

In above graph we can see that most of recording is recorded in 2014

Audio File Registration Per Month Made

In above graph we can see the recording of birds sounds w.r.t months and found that in month of may and june most of recordings are recorded

The Songs

We must be careful in how we interpret this column because it is one of the more erratic ones. Call, song, and flight are the most common song types.

from this bar chart we categories the quality of sound and found that most of them are not specified

Loading and Visualizing an audio file

Duration and Audio File Types

From above graph, we can conculde that most of the recording are less or equal than 100 seconds

Sampling rate

Sampling rate (audio) Sampling rate or sampling frequency defines the number of samples per second.

From above interactive graph we can see that most of recordings are from 44100Hz to 48000Hz

Bird Species

From above bird distrution species we can see that most frequent specie is Sand Martin

Country:

Country in which the observation is made

From above graph, we can see the top list of countries with most recording and USA is the on top with most number of recording

Duration:

Duration of the observation

Geographical Analysis of the Birds:

From above geographical, we can see that from which part of world recorded most voices we can intract with geographical map by hovering mouse curser

Some Samples of Recordings

Above is the sample voice of amered

Above is the sample voice of vesspa

Features Extracting from Birds Voice

The audio data is composed by:

  1. Sound
  2. Sample Rate

we load the voice sample of vesspa and printed the shape of audio and sample rate also the length of audio

Trim leading and trailing silence from an audio signal (silence before and after the actual audio)

Above we load sound of amered, cangoo, haiwoo, pingro and vesspa and trim them

Graphical Representation of Birds sounds Waves

Applying Fourier Transform to Smoother the Sound Waves

Function that takes an input signal in the time domain and produces the signal's frequency breakdown. Both the y-axis (frequency) and the "colour" axis (amplitude), which approximates the log scale of amplitudes, should be transformed to the log scale.

Convert Sound Voices an amplitude spectrogram

From above 5 images we can see the spectogram of all choosed 5 bird species and can noticed that there frequencies have different patterns with passage of time such as vesspa have high frequency while cangoo have different patterbs

Mel Spectrogram: Sounds Frequencies are converted to the mel scale

From above mel spectogram we can clearly see the difference of frequencies b/w different species

Zero Crossing Rate: Transitions from positive to zero to negative or negative to zero to positive

From above information we can see that amered and vesspa have high change rate from positive to zero to negative or negative to zero to positive

Perceptual Correlate Of Waveform Periodicity

RMSE: Signal energy, calculates the square root of the mean square

Single Bird Voice Analysis

log frequency of bird astfly and noticed that the frequency range are from 4000 to 8000

From above spectrogram, we can see the value range is from 0.001 to 0.1

EDA Ended

Model Implementation

Duration and File Types torchlibrosa for audio file processing and features extraction

torchlibrosa, a PyTorch based implementation are used to replace some of the librosa's functions. Here I use some functions of torchlibrosa.

If users previously used for training cpu-extracted features from librosa, but want to add GPU acceleration during training and evaluation, TorchLibrosa will provide almost identical features to standard torchlibrosa functions (numerical difference less than 1e-5) Ref: https://github.com/qiuqiangkong/torchlibrosa

Audioset Tagging CNN

We also use Cnn14_DecisionLevelAtt model from [Models models], which is a SED model.

Building blocks

What is good in MASK RCNN models is that they accept raw audio clip as input. Let's put a chunk into the CNN feature extractor of the model above.

In MASK_RCNN, input raw waveform will be converted into log-melspectrogram using torchlibrosa's utilities. I put this functionality in MASK_RCNN.preprocess() method. Let's check the output.

MASK_RCNN.cnn_feature_extractor() method will take this as input and output feature map. Let's check the output of the feature extractor.

Although it's downsized through several convolution and pooling layers, the size of it's third dimension is 15 and it still contains time information. Each element of this dimension is segment. In SED model, we provide prediction for each of this.

Train SED model with only weak supervision

weak-label-vs-strong-label


This figure gives us an intuitive explanation what is weak annotation and what is strong annotation in terms of sound event detection. For this competition, we only have weak annotation (clip level annotation). Therefore, we need to train our SED model in weakly-supervised manner.

In weakly-supervised setting, we only have clip-level annotation, therefore we also need to aggregate that in time axis. Hense, we at first put classifier that outputs class existence probability for each time step just after the feature extractor and then aggregate the output of the classifier result in time axis. In this way we can get both clip-level prediction and segment-level prediction (if the time resolution is high, it can be treated as event-level prediction). Then we train it normally by using BCE loss with clip-level prediction and clip-level annotation.

Let's check how this is implemented in the PANNs model above. segment-wise prediction and clip-wise prediction is actually calculated in AttBlock of the model.

class AttBlock(nn.Module):
    def __init__(self,
                 in_features: int,
                 out_features: int,
                 activation="linear",
                 temperature=1.0):
        super().__init__()

        self.activation = activation
        self.temperature = temperature
        self.att = nn.Conv1d(
            in_channels=in_features,
            out_channels=out_features,
            kernel_size=1,
            stride=1,
            padding=0,
            bias=True)
        self.cla = nn.Conv1d(
            in_channels=in_features,
            out_channels=out_features,
            kernel_size=1,
            stride=1,
            padding=0,
            bias=True)

        self.bn_att = nn.BatchNorm1d(out_features)
        self.init_weights()

    def init_weights(self):
        init_layer(self.att)
        init_layer(self.cla)
        init_bn(self.bn_att)

    def forward(self, x):
        # x: (n_samples, n_in, n_time)
        norm_att = torch.softmax(torch.clamp(self.att(x), -10, 10), dim=-1)
        cla = self.nonlinear_transform(self.cla(x))
        x = torch.sum(norm_att * cla, dim=2)
        return x, norm_att, cla

    def nonlinear_transform(self, x):
        if self.activation == 'linear':
            return x
        elif self.activation == 'sigmoid':
            return torch.sigmoid(x)

In the forward method, it at first calculate self-attention map in the first line norm_att = torch.softmax(torch.clamp(self.att(x), -10, 10), dim=-1). This will be used to aggregate the classification result for segment. In the second line, cla = self.nonlinear_transform(self.cla(x)) calculates segment wise classification result. Then in the third line, attention aggregation is performed to get clip wise prediction.

Now, let's try to train this model in weakly-supervised manner.

Dataset

Criterion

Callbacks

Train

Some code are taken from https://www.kaggle.com/ttahara/training-birdsong-baseline-resnest50-fast . Thanks @ttahara!

Seems it's learning something.

Now I'll show how this model works in the inference phase. I'll use trained model of this which I trained by myself using the data of this competition in my local environment.

Since several concerns are expressed about over-sharing of top solutions during competition, and since I do respect those people who have worked hard to improve their scores, I would not make trained weight in common and would not share how I trained this model.

Prediction with SED model

Postprocess

EOF